648 research outputs found
S4Net: Single Stage Salient-Instance Segmentation
We consider an interesting problem-salient instance segmentation in this
paper. Other than producing bounding boxes, our network also outputs
high-quality instance-level segments. Taking into account the
category-independent property of each target, we design a single stage salient
instance segmentation framework, with a novel segmentation branch. Our new
branch regards not only local context inside each detection window but also its
surrounding context, enabling us to distinguish the instances in the same scope
even with obstruction. Our network is end-to-end trainable and runs at a fast
speed (40 fps when processing an image with resolution 320x320). We evaluate
our approach on a publicly available benchmark and show that it outperforms
other alternative solutions. We also provide a thorough analysis of the design
choices to help readers better understand the functions of each part of our
network. The source code can be found at
\url{https://github.com/RuochenFan/S4Net}
SLS4D: Sparse Latent Space for 4D Novel View Synthesis
Neural radiance field (NeRF) has achieved great success in novel view
synthesis and 3D representation for static scenarios. Existing dynamic NeRFs
usually exploit a locally dense grid to fit the deformation field; however,
they fail to capture the global dynamics and concomitantly yield models of
heavy parameters. We observe that the 4D space is inherently sparse. Firstly,
the deformation field is sparse in spatial but dense in temporal due to the
continuity of of motion. Secondly, the radiance field is only valid on the
surface of the underlying scene, usually occupying a small fraction of the
whole space. We thus propose to represent the 4D scene using a learnable sparse
latent space, a.k.a. SLS4D. Specifically, SLS4D first uses dense learnable time
slot features to depict the temporal space, from which the deformation field is
fitted with linear multi-layer perceptions (MLP) to predict the displacement of
a 3D position at any time. It then learns the spatial features of a 3D position
using another sparse latent space. This is achieved by learning the adaptive
weights of each latent code with the attention mechanism. Extensive experiments
demonstrate the effectiveness of our SLS4D: it achieves the best 4D novel view
synthesis using only about parameters of the most recent work.Comment: 10 pages, 6 figure
MonoNeuralFusion: Online Monocular Neural 3D Reconstruction with Geometric Priors
High-fidelity 3D scene reconstruction from monocular videos continues to be
challenging, especially for complete and fine-grained geometry reconstruction.
The previous 3D reconstruction approaches with neural implicit representations
have shown a promising ability for complete scene reconstruction, while their
results are often over-smooth and lack enough geometric details. This paper
introduces a novel neural implicit scene representation with volume rendering
for high-fidelity online 3D scene reconstruction from monocular videos. For
fine-grained reconstruction, our key insight is to incorporate geometric priors
into both the neural implicit scene representation and neural volume rendering,
thus leading to an effective geometry learning mechanism based on volume
rendering optimization. Benefiting from this, we present MonoNeuralFusion to
perform the online neural 3D reconstruction from monocular videos, by which the
3D scene geometry is efficiently generated and optimized during the on-the-fly
3D monocular scanning. The extensive comparisons with state-of-the-art
approaches show that our MonoNeuralFusion consistently generates much better
complete and fine-grained reconstruction results, both quantitatively and
qualitatively.Comment: 12 pages, 12 figure
Semantic-Aware Transformation-Invariant RoI Align
Great progress has been made in learning-based object detection methods in
the last decade. Two-stage detectors often have higher detection accuracy than
one-stage detectors, due to the use of region of interest (RoI) feature
extractors which extract transformation-invariant RoI features for different
RoI proposals, making refinement of bounding boxes and prediction of object
categories more robust and accurate. However, previous RoI feature extractors
can only extract invariant features under limited transformations. In this
paper, we propose a novel RoI feature extractor, termed Semantic RoI Align
(SRA), which is capable of extracting invariant RoI features under a variety of
transformations for two-stage detectors. Specifically, we propose a semantic
attention module to adaptively determine different sampling areas by leveraging
the global and local semantic relationship within the RoI. We also propose a
Dynamic Feature Sampler which dynamically samples features based on the RoI
aspect ratio to enhance the efficiency of SRA, and a new position embedding,
\ie Area Embedding, to provide more accurate position information for SRA
through an improved sampling area representation. Experiments show that our
model significantly outperforms baseline models with slight computational
overhead. In addition, it shows excellent generalization ability and can be
used to improve performance with various state-of-the-art backbones and
detection methods
PCT: Point cloud transformer
The irregular domain and lack of ordering make it challenging to design deep
neural networks for point cloud processing. This paper presents a novel
framework named Point Cloud Transformer(PCT) for point cloud learning. PCT is
based on Transformer, which achieves huge success in natural language
processing and displays great potential in image processing. It is inherently
permutation invariant for processing a sequence of points, making it
well-suited for point cloud learning. To better capture local context within
the point cloud, we enhance input embedding with the support of farthest point
sampling and nearest neighbor search. Extensive experiments demonstrate that
the PCT achieves the state-of-the-art performance on shape classification, part
segmentation and normal estimation tasks.Comment: 11 pages, 5 figure
Attention Mechanisms in Computer Vision: A Survey
Humans can naturally and effectively find salient regions in complex scenes.
Motivated by this observation, attention mechanisms were introduced into
computer vision with the aim of imitating this aspect of the human visual
system. Such an attention mechanism can be regarded as a dynamic weight
adjustment process based on features of the input image. Attention mechanisms
have achieved great success in many visual tasks, including image
classification, object detection, semantic segmentation, video understanding,
image generation, 3D vision, multi-modal tasks and self-supervised learning. In
this survey, we provide a comprehensive review of various attention mechanisms
in computer vision and categorize them according to approach, such as channel
attention, spatial attention, temporal attention and branch attention; a
related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is
dedicated to collecting related work. We also suggest future directions for
attention mechanism research.Comment: 27 pages, 9 figure
Learning virtual view selection for 3D scene semantic segmentation
2D-3D joint learning is essential and effective for fundamental 3D vision tasks, such as 3D semantic segmentation, due to the complementary information these two visual modalities contain. Most current 3D scene semantic segmentation methods process 2D images “as they are”, i.e., only real captured 2D images are used. However, such captured 2D images may be redundant, with abundant occlusion and/or limited field of view (FoV), leading to poor performance for the current methods involving 2D inputs. In this paper, we propose a general learning framework for joint 2D-3D scene understanding by selecting informative virtual 2D views of the underlying 3D scene. We then feed both the 3D geometry and the generated virtual 2D views into any joint 2D-3D-input or pure 3D-input based deep neural models for improving 3D scene understanding. Specifically, we generate virtual 2D views based on an information score map learned from the current 3D scene semantic segmentation results. To achieve this, we formalize the learning of the information score map as a deep reinforcement learning process, which rewards good predictions using a deep neural network. To obtain a compact set of virtual 2D views that jointly cover informative surfaces of the 3D scene as much as possible, we further propose an efficient greedy virtual view coverage strategy in the normal-sensitive 6D space, including 3-dimensional point coordinates and 3-dimensional normal. We have validated our proposed framework for various joint 2D-3D-input or pure 3D-input based deep neural models on two real-world 3D scene datasets, i.e., ScanNet v2 and S3DIS, and the results demonstrate that our method obtains a consistent gain over baseline models and achieves new top accuracy for joint 2D and 3D scene semantic segmentation. Code is available at https://github.com/smy-THU/VirtualViewSelection
- …